GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset by metsw24-max · Pull Request #50089 · apache/arrow

metsw24-max · 2026-06-04T06:40:22Z

int32 overflow in the bit-packed run decoder offset
GetBatch works out the byte position with values_read_ * value_bit_width in 32-bit int. For a large bit-packed run (this decodes untrusted parquet RLE/bit-packed dictionary indices and levels, with value width up to 64) the product passes INT32_MAX and wraps negative, so bytes_fully_read goes negative and unread_data ends up before the buffer, giving an out of bounds read in unpack. raw_data_size just above already widens to int64 before the same multiply, so I matched that here.

GitHub Issue: [C++][Parquet] int32 overflow in BitPackedRunDecoder::GetBatch offset for large bit-packed runs #50162

github-actions · 2026-06-04T06:41:38Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

pitrou · 2026-06-10T15:06:11Z

@metsw24-max Thanks for submitting this PR! Can you open a corresponding GitHub issue as instructed by the comment bot above?

@AntoinePrv Would you like to review these changes?

AntoinePrv

I think ideally we would want to check if the untrusted int is within the bounds of the parquet specs, and if not error (here it may be return negative value?).

Though is the bit_width check done before? if not, should it?

AntoinePrv · 2026-06-11T08:51:41Z

+        /* .bit_offset= */ static_cast<int>(bits_read % 8),
+        /* .max_read_bytes= */ static_cast<int>(max_read_bytes_ - bytes_fully_read),


In the same way that it may previously not fit in an int do we know that it will fit it at this point (and not turn negative)?

For parser-produced runs it does hold: PeekImpl only emits a run whose whole payload fits in the remaining buffer (it truncates or rejects the run otherwise), and the BitPackedRun constructor DCHECKs that invariant. Since values_read_ <= values_count_, bytes_fully_read <= max_read_bytes_, so the difference stays in [0, max_read_bytes_], which is itself an rle_size_t. The remaining case is the negative sentinel (no bound), where the difference stays negative and unpack treats any negative value the same as -1. Added a DCHECK at the subtraction site to make that explicit.

github-actions · 2026-06-12T12:32:57Z

⚠️ GitHub issue #50162 has been automatically assigned in GitHub to PR creator.

metsw24-max · 2026-06-12T12:34:44Z

@pitrou done, opened GH-50162 and renamed the title to match.

@AntoinePrv on the spec bounds: the bit width is already checked before it reaches this decoder. Dictionary indices reject anything above 32 in DictDecoderImpl::SetData, and for rep/def levels the width isn't read from the file at all, it's derived from max_level. Run lengths are bounded by the parser, which truncates or rejects a run that would overflow the buffer. The catch is that the values overflowing here are all within spec: a single bit-packed run can validly hold close to 2^31 values, so values_read_ * value_bit_width passes INT32_MAX on legitimate data once a run grows past 256 MiB. So I don't think there's an out-of-spec value to error on at this level; the intermediate just needs the wider type, same as raw_data_size above. Happy to add an explicit error path if you'd rather have one.

avoid int32 overflow in BitPackedRunDecoder::GetBatch offset

750aa76

github-actions Bot added the awaiting review Awaiting review label Jun 4, 2026

github-actions Bot added the Component: C++ label Jun 4, 2026

AntoinePrv reviewed Jun 11, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 11, 2026

metsw24-max mentioned this pull request Jun 12, 2026

[C++][Parquet] int32 overflow in BitPackedRunDecoder::GetBatch offset for large bit-packed runs #50162

Open

metsw24-max changed the title ~~avoid int32 overflow in BitPackedRunDecoder::GetBatch offset~~ GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset Jun 12, 2026

assert the run payload bound before computing max_read_bytes

480bbb8

metsw24-max requested a review from pitrou as a code owner June 12, 2026 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset#50089

GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset#50089
metsw24-max wants to merge 2 commits into
apache:mainfrom
metsw24-max:rle-bitpacked-offset-overflow

metsw24-max commented Jun 4, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

pitrou commented Jun 10, 2026

Uh oh!

AntoinePrv left a comment

Uh oh!

AntoinePrv Jun 11, 2026

Uh oh!

metsw24-max Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

metsw24-max commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		/* .bit_offset= */ static_cast<int>(bits_read % 8),
		/* .max_read_bytes= */ static_cast<int>(max_read_bytes_ - bytes_fully_read),

Conversation

metsw24-max commented Jun 4, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

pitrou commented Jun 10, 2026

Uh oh!

AntoinePrv left a comment

Choose a reason for hiding this comment

Uh oh!

AntoinePrv Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

metsw24-max Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

metsw24-max commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

metsw24-max commented Jun 4, 2026 •

edited by github-actions Bot

Loading